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Abstract. To find spatial data, interested parties have to know where and 
how to access existing geospatial web services or geoportals. People in geo- 
graphic information communities are aware of such online platforms, but 
what about others? Finding relevant spatial data that satisfy the needs of 
the requesters is still an issue. I n this paper, we propose a method for doc- 
umenting spatial data for both human and machine consumption. Through 
this method, a model for compiling spatial metadata based on the mapping 
between ISO 19115 spatial metadata standard and Dublin core is designed. 
The ultimate goal with this model is the documentation of spatial data in 
HTML for discovery by web search engines, but at the same time being un- 
derstandable to users. 
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1. Introduction 

Geoportals, implemented as part of spatial data infrastructures (SDIs), al- 
low the online dissemination of spatial data from providers. Geoportals 
provide web platforms to search for spatial data and associated metadata 
content. However, the discovery of geospatial services and spatial data re- 
mains a challenge (Lopez-Pellicer 2012). Business, legal and technological 
barriers result in the invisibility of geospatial web services to well-known 
general purpose web search engines. This hinders the visibility of geopor- 
tals and thus the discovery of spatial data. Furthermore, considerable 
amounts of onl i ne spati al data are of no val ue due to the I ack of supporti ng 
information about them. 

Attempts to solve the technological barriers related to the invisibility of ge- 
ospatial web services have been channeled towards the development of fo- 



cused crawlers with the sole mission of surfacing geospatial web contents 
(services and data) behind geoportals. Such efforts are centred on web 
crawlers' capabilities to search for geospatial web services, rather than on 
the ability of such resources making themselves visible to web search en- 
gines. I n a number of studies, these focused crawlers use web pages listed 
by Bing, Google or Yahoo as seed for further search refinement (Lopez- 
Pellicer 2012). 

The research reported in this paper is part of an experimental endeavour to 
empirically test the discoverability of HTML pages with information about 
geographic information resources by general purpose web search engines. It 
is motivated by current difficulties in finding spatial data with general pur- 
pose web search engines (Katumba et al 2012). The test results will lead to 
an improved understanding of how to prepare HTML documents (files) 
with spatial metadata about geographic information resources on the web. 
These documents are the indices for web search engines and serve as de- 
scriptions of geospatial data for users. This approach is used by web re- 
source publishers for general purpose internet search. It has been most of 
the time anecdotal I y proven to be effective by professionals engaged in op- 
timising web search engines and web resources' visibility. However, seldom 
scientific studies, such as the one by Zhang and Dimitroff (2005), empiri- 
callytested its effectiveness. Using this technique for spatial data discovery 
through associated spatial metadata published online in HTML format, 
makes the study reported in this paper disti net to others. The study ai ms to 
contribute towards an improved understanding of how spatial metadata 
should be embedded in HTML pages for better discovery of the correspond- 
ing spatial data by web search engines. I n this paper, we describe how the 
results of a user study were used to develop a user-centered methodology 
for preparing the HTML pages for the first empirical tests. 

The remainder of this paper is structured as follows: in Section 2, a survey 
of related studies is provided as context for the research. I n Section 3, the 
methodology is described. In Section 4, initial results are presented and 
discussed. Section 5 concludes the paper. 



2. Related studies 

I n thi s secti on, concepts rel ated to web search engi nes, the deep web, meta- 
data, search engine optimisation and web search behaviour are discussed. 
Emphasis is put on describing how these concepts are relevant in the 
context of this research. 



2.1 . Web search engines and the deep web 

Web pages are added to web search engi ne databases either by crawl i ng and 
indexing hypertext links or when information about web pages is submitted 
to the web search engine directories (Kumar etal. 2011). Web pages that are 
i nvi si bl e to web search engi nes are sai d to be part of the "deep web" because 
their contents are hidden behind "query forms" or "web service interfaces" 
served by backend databases (Wu et al. 2006). Geoportalsfall into this cat- 
egory because one has to enter query information, e.g. keywords in a text 
box, to retrieve relevant spatial contents from a database (Lopez-Pel I icer 
2012). It is impossible for web search engi nes to figure out all possible key- 
words for these text boxes, because web search engines are only designed to 
crawl and i ndex existi ng web contents. 

Ntoulas et al. (2005) proposed a web crawler that automatically queries 
hidden web page interfaces served by backend text databases. Madhavan et 
al. (2008) described the inclusion of web pages hidden behind HTML forms 
served by SQL databases into the Google search engine indexing process. In 
order to surface geospatial web services and associated contents, Lopez- 
Pellicer (2012) proposed the implementation of a crawler focused on the 
di scovery of OGC web servi ces. 

The literature suggests that efforts towards solving the technical problems 
related to the invisibility of deep web contents in general and geospatial 
resources in particular, have been on the design and implementation of 
focused crawlers. Such attempts require considerable amounts of computa- 
tional resources to match well established web search engines. Further- 
more, anyone looking for spatial information has to know about the exist- 
ence of such crawlers which presumably would be hosted inside geospatial 
web services or geoportals. Not knowing about the existence of geoportals 
in the first place, makes it impossible for anyone searching for spatial in- 
formation to use these search engines (crawlers). Therefore, the use of gen- 
eral purpose web search engines is the optimal option for discovering geo- 
spatial web services and their associated contents. 

2.2. Metadata 
2.2.1. Dublin Core 

Even though well-designed web pages, rich in content, are a requirement 
for high visibility, metadata content also facilitates the indexing and discov- 
ery of such web resources (Greenberg et al. 2001) (Zhang and Dimitroff 
2005). Furthermore, the description of web resources is also provided 
through metadata content. Hence, a metadata standard for use by various 
different web publishers for web resources identification, description and 
discovery is needed. The Dublin Core metadata standard is used as refer- 



ence description for online web resources (Dublin Core 2012). It has been 
considered in this study because of its simplicity and ease of use, its inter- 
national scope, extensibility and most importantly for its generic terms for 
web resource description. 

2.2.2. Spatial metadata (ISO 19115) 

ISO 19115:2003, Geographic I nformati on - Metadata, provides descriptions 
of geographic information and associated services with attributes for the 
identification, extent, quality, spatial and temporal schema, spatial refer- 
ence and distribution of digital geographic data. ISO 19115:2003 was con- 
sidered in this study due to the fact that most of the existing spatial metada- 
ta standards are profiles (specialisations) of it(Nogueras-lsoetal. 2004). 

2.2.3. Mapping between ISO 19115 and Dublin Core 

SincelSO 19115:2003 is a specialised metadata standard and Dublin Coreis 
more generic and therefore suitable for describing web resources, a map- 
ping between the two standards is imperative for the accomplishment of 
this research. In 2003, members of the European Committee for Standardi- 
sation Workshop Agreement (CWA) produced a consensus- based specifica- 
tion "CWA 14857:2003 (E)" that defines the mapping between Dublin Core 
and ISO 19115:2003. The principal aim was to enhance the discovery of 
geographic information in "cross metadata searches" (CEN 2003). Since 
then, its only usefulness has been the design and implementation of tools 
for automatic conversion from one standard to another (Nogueras-lso et al. 
2004). In this study we use this mapping to prepare web resources (HTML 
pages) with spatial metadata that are discoverable by general purpose web 
search engines. We also describe how spatial metadata is integrated into 
web resources (HTML pages) based on the CWA 1487:2003 (E) specifica- 
tion. 

2.3. Search engine optimisation (SEO) 

Since its inception, the World Wide Web has seen a growing number of web 
search engi ne users from a vari ety of disci pi i nes, as wel I as the prol iferati on 
of web sites with web page contents changing constantly. As a result user 
satisfaction for web searches has deteriorated. This prompted information 
retri eval experts to research web search engi ne performance and users' web 
search behaviour. The performance of a web search engine is measured by 
the amount of relevant web sites it lists. It should not take users time and 
effort to get web sites of relevance in response to their queries. This is a 
concern for web publishers who would want their web sites to bevisibleon 
web search engine listings. Among many other considerations, a higher web 
page visibility in response to relevant user queries relies on SEO techniques 



employed by web page designers and publishers. These techniques are 
meant to maximise the indexing of web sites (pages) by search engines. El- 
ements that impact web page visibility are related to the metadata structure 
of the page, its content and the number of hyperlinks pointing to it (Zhang 
and Dimitroff 2005). Thetuning of these elements through SEO contributes 
to the visibility of web pages. However, web publishers can only tune some 
of these. The number of hyperlinks and the way in which users refine their 
query terms, are completely out of the web publisher's control (Greenberg 
2001). Nevertheless, users' web search behaviour in terms of keywords se- 
lection can be monitored to harvest the kind of keywords that users employ 
when searching information on a particular topic. With respect to this re- 
search, harvested keywords serve as guidance in the design of web page 
contents and associated metadata. 

2.4. Web search behaviour 

After analysing a large-scale log of web search behaviour, White et al. 
(2009) concluded that domain expertise greatly influences user behaviour 
and success when querying and retrieving information on web search en- 
gines. They evaluated domain expertise by comparing the vocabulary (tex- 
tual queries) of participants with domain-specific lexicons. Other studies 
with smaller numbers of participants corroborate the findings of White et 
al. (2009): user domain expertise plays a major role in web search engine 
information retrieval (Holscher etal. 2000). 



3. Research design 
3.1. Approach 

Our approach differs from the literature described in section 2 si nee it seeks 
to study and exploit the capability of making web resources with infor- 
mation about spatial data visible to well known web search engines. We 
propose that spatial data be documented in HTML in such a way that it can 
easily be indexed by general purpose web search engines. This is done by 
compiling spatial metadata contents as HTML documents which are craw- 
lable by web search engines. This enables the discovery of spatial data with- 
out the user having to know about existing geoportals with spatial contents 
that are i nvi si bl e to web search engi nes i n the f i rst pi ace. We base thi s exer- 
cise on using search engine optimisation (SEO) techniques applicableto the 
contents of HTM L web pages and associated tags such as the "meta" tag. A 
mapping from ISO 19115:2003 standard to Dublin Core is used to produce 
the end result as HTM L web pages. 



The figure below puts our research into context through a model that de- 
scribes the spatial data search flow of atypical web user. First, the user uses 
a general purpose web search engine to locate information about spatial 
data, next the user enters the geoportal to view the spatial data and evalu- 
ates the metadata i n order to decide whether it can be used for his/her pur- 
pose. Our research is concerned with the outer layer in Figure 1 




Figure 1 Description of the study context 



3.2. Methodology for the research described in this paper 

Si nee users drive online searches in searching for spatial data, itwasimper- 
ativeto come up with a user-centered methodology for preparing the HTM l_ 
pages for the empirical tests. Furthermore, the findings of the related work 
as reported in section 2.4, suggest that user domain expertise plays a major 
role in web search engine information retrieval, irrespective of the number 
of users that participate in a given web search experiment. Therefore, we 
conducted a user study to determine which keywords users employ to 
search for geospatial data. Keywords collected from the user study were 
qualitatively analysed based on occurrence patterns in different partici- 
pants' query terms. The analysis of these keywords facilitated the design of 
the model that defines the process of compiling spatial metadata as web 
resources (HTM L pages). 



3.2.1. Keyword experiment (user study) 

A group of 17 BSc Geoi nformati cs students in their final year (third year) of 
studies participated in an experiment of two hours. They were required to 
search the web for spatial data to address a particular problem of their 
choice. They were advised to use any knowledge they had acquired in their 
studies. Their keyword (textual queries) selection was carefully monitored 
using the Mozilla Firefox web browser cache. A qualitative analysis of the 
collected keywords was performed to identify keywords used by all partici- 
pants, excluding those specific to each individual's topic (domain) of inter- 
est. Commonality of keywords was considered because it reflects user be- 
haviour in terms of keyword (or textual queries) selection. 

4. Results 

4.1. Spatial metadata compilation model (SMCM) 

Selected search terms used by experi ment parti ci pants are given i n Table 1 



Participant ID 


Search terms 


P1 


rwanda spatial data sets shapefile 


P2 


Ghana land cover shape file 


P3 


south africa vegetation cover spatial data 


P4 


spatial data of electricity distribution in rwanda 


P5 


raster data of south African power plan 


P6 


east african rift system formation in kenya 


P7 


land+use+data+sets+limpopo 


P8 


esri+shapefiles+schools 


P9 


Spatial data for nile river 


P10 


"Botswana", "water bodies", 


P11 


Cairo roads spatial data free 


P12 


mount kilimanjaro shapefile download 



TableL Sampl e of keywords used by parti ci pants 



A qualitative analysis of participants' keyword selection suggests that spa- 
tial data search terms can be classified into three main categories as de- 
scribed in Figure 2. 

• Topic: defines the subject of interest for which spatial data is being 
searched. Examples are: infrastructures, urban planning, water re- 
sources, and environmental management. 

• Location: defines the spatial extent or geographic coverage of the 
spatial data being searched. Keywords related to the location are 
usually place name of geographic areas, eg. Africa or J ohannesburg. 

• Geographic feature: defines the actual geographic object or feature 
of interest to the user. There are two sub- categories with respect to 
the two main spatial data models, namely vector and raster. Under 
the vector model, keywords are based on the geographic feature 
primitives such point, line or polygon, "shapefi I e" as keyword under 
the vector category was also used, si nee it is a wel I -known format for 
vector (spatial) data. The keyword "raster" is useful when looking 
for continuous geographic features. 




Topi c/ Subject 



Location/ Spatial extent 



Geographic feature 
type 



" Land cover/ Land use 
— I nfrastructures 
- Water Resources 



-Africa 



France 

Sub-Saharan 
region 









Raster 




Vector (points 
lines, polygons) 


Raster 


— Shapefi le/ Map 



Figure 2. Spatial metadata compilation model diagram 

The three main categories of keywords in the proposed model, describe how 
well users' keywords can be matched into the elements of the mapping be- 
tween ISO 19115 and Dublin core standards. This concept (model) facili- 



tates the preparation of spatial metadata for online publication not only for 
GIS professionals who are accustomed to ISO 19115 but for anyone who 
uses Dublin core. 

4.2. Mapping SMCM categories to Dublin Core via ISO 19115 

The SM CM categori es were mapped to correspondi ng core el ements of I SO 
19115. This mapping facilitates compiling spatial metadata contents based 
on ISO 19115:2003 metadata standard. The core elements of ISO 19115 al- 
low an easy understanding of geographic data by both consumers and pro- 
ducers (Nogueras-lso et al. 2004). Moving from ISO 19115 to Dublin Core 
was the ultimate task of this exercise, since Dublin Core is the de facto 
standard for the description and discovery of web resources (HTML web 
pages) . The tabl e bel ow descri bes the mappi ng process. 



SMCM category 


Corresponding ISO 19115 core 
elements 


Corresponding Dublin 
Core elements 


Topic/Subject 


• Dataset title 


TITLE 


• Dataset topic category 


SUBJECT 


• Abstract describing the datasets 


DESCRIPTION 


Location/Spatial 
Extent 


• Geographic location of the data- 
set (by four coordinates or geo- 
graphic identifier) 


COVERAGE: SPATIAL 


• Additional extent information for 
the dataset (vertical and tempo- 
ral) 


COVERAGE: TEMPORAL 


Geographic 
Feature Type 


• Spatial representation 


TYPE 


• Distribution format 


FORMAT 


• Lineage 


SOURCE 




• Dataset responsible party 


CREATOR 
PUBLISHER 


• Dataset reference date 


DATE 


• On-line resource 


IDENTIFIER 


• Dataset Language 


LANGUAGE 



Table 2. M appi ng from SM CM to Dubl i n Core vi a I SO 19115 



4.3. Application example 

A practical example of the proposed model (SMCM) is provided to illustrate 
its application. The example describes a scenario illustrating the documen- 
tation of spatial data appropriate for discovery by participant Pll whose 
topic for spatial data discovery was "Cairo roads in Egypt". A dataset with 
thetitle "Egypt - Roads" on the "FAO Africover 1 " website was used, because 
it contains detailed metadata. Figure 3 shows an extract of the metadata 
web page for the spatial dataset considered. 









FOOD AND AGRICULTURE 

ORGANIZATION /\ r W~ \ 
of The United Nations JT \ 1 1 1 










r Print 1 


National Focal Point Institution: 




O rgan isati an : 


FAC - Africouer 










Title: 


Eoypt - Roads {Africover} 






Dataset Reference Date: 


z. z : - : - : - 






~ 3;a;e: "efe-erte Date 
Type; 


Publication 






Ddtdiel Edition; 


First 






Presentation foTnat: 


mapDigital 






Abstract : 


The roads have been produced from visual 
interpretation of digitally enhanced LAMDSAT TM 
images (Bands acquired mainly in the year 
1 v 9 " . 






Purpose : 


The roads have been included for orientation 
purposes and should net be seen as comprehensive. 






Completedness'Prcgres? 

:- ode: 


Complete 






Theme Keywords: 


orientation, roads 






Place Keyword: 


Egypt 






ISO Topic Category: 


Earth Cover 






Supplemental 








(Direct Spatial Reference 
Method; 


Vector 






Date Format; 


ESRI ArcView ShapefiEe (,shp) 






Scale cf the Dataset; 


lilOO ooo 






Dataset Language; 


English 






Dataset Character Set: 


usAscii 






Resource Provider; 


Mr. AntciMc Di :3 - «cc- c - FAO Ar-iccver 






Point of Contact; 


Mr. An;cnie I>i Greocric - FAC Africover 






Custodian: 


Nabl El Mc.".«lhi - Sail and '.Va;er Research 








I . • • - _ Ltlture 






Owner; 


Dr. Nabil El Mowelhi - Scit and Water Research 






O riginator : 


i' - - ! --■ • . ■ - .: :u (Cure 

Mr. Aritcric Di ;~ acic - F A C; Ahcover 






Processor: 


r-1 - A h : ".-i o -eacric - F — ~ AtTccver 






Publisher: 


Mr. Antonio Di G-eac-u - FAO Africover 











Figure 3. FAO Africover metadata fi le of Egypt-Roads dataset 



1 FAO Af ri cover proj ect, http://www.africover.orq/ 



The application of the proposed method for preparing spatial metadata for 
i nserti on i n H TM L pages i s descri bed as f ol I ows: 



SMCM category and 
value 


ISO 191 15 elements and values 


Topic: 

Cairo roads in Egypt 


Dataset title: Egypt - Roads 


Dataset topic category: Roads Network of Egypt 


Abstract describing the datasets: The roads of 
Egypt have been produced from visual inter- 
pretation of digitally enhanced LANDS AT TM 
images (Bands 4,3,2) acquired mainly in the 
year 1997. 


Location: 
Cairo, Egypt 


Geographic Location of the dataset: Cairo, EGYPT 


Geographic Feature types: 
Roads 


Data Format: ESRI ArcView Shapefile (.shp) 


Lineage: The roads have been produced from 
visual interpretation of digitally enhanced LAND- 
SAT TM images (Bands 4,3,2) acquired mainly in 
the year 1997. 


Spatial representation: Vector 



Table 3. M appi ng from SM CM to I SO 19115:2003 



The final HTML page result in Figure 4 illustrates how the different field 
elements of the Dublin Core standard are filled using the "HTML meta"tag 
based on the mapping from I SO 19115 to Dublin core. It is possiblefor web 
search engines and their crawlers to index web resources (HTML docu- 
ments prepared in this way) for optimum spatial data (metadata) discovery 
because web search engines are best at discovering HTM L pages. 



<link rel="schema.DC" href="http://purl.org/dc/elements/l.l/" /> 

<link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" /> 

<meta name="DC.title" lang="English " content=" Egypt - Roads" l> 

<meta name="DC.creator" content=".FA0 AFRICOVER" l> 

<meta name="DC. subject" lang="English " content="/foarfs Network of 
Egypt " /> 

<meta name= "DC. publisher" content=" FAO AFRICOVER" l> 

<meta name="DC. description" content="77ze roads of Egypt have been 
produced from visual interpretation of digitally enhanced LANDS AT TM 
images (Bands 4,3,2) acquired mainly in the year 1997." I> 

<meta name="DC.date" content="2012-ll-09" /> 

<meta name="DC.type" content=" Vector" l> 

<meta name="DC. format" content="£'5/?/ Arc View Shapefile (.shp)" l> 

<meta name="DC.identifier" scheme="DCTERMS.URI" con- 
tent= "http://www. africover. org ' ' /> 

<meta name="DC.language" scheme="DCTERMS.URI" con- 
tent="English" /> 

<meta name="DC.coverage" scheme="DCTERMS .URI" content="Cairo; 
Egypt; Northern Africa ; Africa" l> 

<meta name="DC.rights" scheme="DCTERMS.URI" content= "Copyright, 
FAO AFRICOVER 2012 All rights reserved" l> 

Figure 4. Dublin core metadata contents of the final HTML page result 
5. Conclusion 

We described how the results of a user study guided the development of a 
user-centered methodology for adding spatial metadata to HTML pages. 
These pages wi 1 1 be used i n f i rst empi ri cal tests of the effecti veness of search 
engine optimization. The proposed method is based on a model (SMCM) 
designed with input from an analysis of keywords obtained from an exper- 
iment in which users had to search for spatial data on the web. A mapping 
between ISO 19115:2003 spatial metadata and the Dublin core was used to 
prepare spatial metadata about geospatial resources for enhanced discovery 
by general purpose web search engines. The empirical tests are currently in 
progress and first results will be reported at the conference. First indica- 



tions are that the pages are discoverable by general purpose web search 
engines. We plan to refine the SMCM model with results from additional 
studies with different user groups. This work can be extended by designing 
a tool to automate the process of spatial data documentation following the 
proposed spatial metadata data compilation model. Further empirical tests 
will be done by submitting HTML web pages obtained from the proposed 
methodology to web search engine directories. Subsequently, the HTML 
web pages i n web search engi nes page ranki ngs wi 1 1 be eval uated. 
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